Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

نویسندگان

  • Haryadi S. Gunawi
  • Riza O. Suminto
  • Russell Sears
  • Casey Golliher
  • Swaminathan Sundararaman
  • Xing Lin
  • Tim Emami
  • Weiguang Sheng
  • Nematollah Bidokhti
  • Caitie McCaffrey
  • Gary Grider
  • Parks M. Fields
  • Kevin Harms
  • Robert B. Ross
  • Andree Jacobson
  • Robert Ricci
  • Kirk Webb
  • Peter Alvaro
  • H. Birali Runesha
  • Mingzhe Hao
  • Huaicheng Li
چکیده

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Post-Triassic normal faulting and extensional structures in Central Alborz, Northern Iran

This paper presents structural evidence of extensional activity in Central Alborz during Mesozoic. The structural evidence of homogenous early stage stretching such as layer-parallel to oblique boudinage of Permian and Triassic rocks in various portions of the study area accompanied by extensional-fibrous fractures were followed with more advanced extensional features. These extensional structu...

متن کامل

Development of system decision support tools for behavioral trends monitoring of machinery maintenance in a competitive environment

The article is centred on software system development for manufacturing company that produces polyethylene bags using mostly conventional machines in a competitive world where each business enterprise desires to stand tall. This is meant to assist in gaining market shares, taking maintenance and production decisions by the dynamism and flexibilities embedded in the package as customers’ demand ...

متن کامل

Generation Scheduling in Large-Scale Power Systems with Wind Farms Using MICA

The growth in demand for electric power and the rapid increase in fuel costs, in whole of theworld need to discover new energy resources for electricity production. Among of the nonconventionalresources, wind and solar energy, is known as the most promising deviceselectricity production in the future. In this thesis, we study follows to long-term generationscheduling of power systems in the pre...

متن کامل

Communication-efficient Outlier Detection for Scale-out Systems

Modern scale-out services are built on top of large datacenters composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale outages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, har...

متن کامل

FAIL-FCI: Versatile fault injection

One of the topics of paramount importance in the development of Grid middleware is the impact of faults, since their probability of occurrence in a Grid infrastructure and in large-scale distributed systems is actually very high. In this paper, we explore the versatility of a new tool for fault injection in distributed applications: FAIL-FCI. In particular, we show that not only are we able to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018